Search CORE

11 research outputs found

Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation

Author: Hao Wenjie
Mu Lingling
Xu Hongfei
Zan Hongying
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/12/2022
Field of study

In this paper, we study the use of deep Transformer translation model for the CCMT 2022 Chinese-Thai low-resource machine translation task. We first explore the experiment settings (including the number of BPE merge operations, dropout probability, embedding size, etc.) for the low-resource scenario with the 6-layer Transformer. Considering that increasing the number of layers also increases the regularization on new model parameters (dropout modules are also introduced when using more layers), we adopt the highest performance setting but increase the depth of the Transformer to 24 layers to obtain improved translation quality. Our work obtains the SOTA performance in the Chinese-to-Thai translation in the constrained evaluation

arXiv.org e-Print Archive

NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering

Author: van Genabith Josef
Xiong Deyi
Xu Hongfei
Zan Hongying
Zhang Tengxun
Publication venue
Publication date: 07/11/2022
Field of study

Hybrid tabular-textual question answering (QA) requires reasoning from heterogeneous information, and the types of reasoning are mainly divided into numerical reasoning and span extraction. Despite being the main challenge of the task compared to extractive QA, current numerical reasoning method simply uses LSTM to autoregressively decode program sequences, and each decoding step produces either an operator or an operand. However, the step-by-step decoding suffers from exposure bias, and the accuracy of program generation drops sharply with progressive decoding. In this paper, we propose a non-autoregressive program generation framework, which facilitates program generation in parallel. Our framework, which independently generates complete program tuples containing both operators and operands, can significantly boost the speed of program generation while addressing the error accumulation issue. Our experiments on the MultiHiertt dataset shows that our model can bring about large improvements (+7.97 EM and +6.38 F1 points) over the strong baseline, establishing the new state-of-the-art performance, while being much faster (21x) in program generation. The performance drop of our method is also significantly smaller than the baseline with increasing numbers of numerical reasoning steps

arXiv.org e-Print Archive

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

Author: Jia Yuxiang
Xu Hongfei
Yang Songhua
Zan Hongying
Zhao Hanjie
Zhou Guangyu
Zhu Senbin
Publication venue
Publication date: 13/08/2023
Field of study

Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot always align responses with safety and professionalism experts. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from pre-training to reinforcement learning with human feedback (RLHF). Additionally, we introduce a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We define a refined annotation rule and evaluation criteria given the biomedical domain's unique characteristics. Results show that our model outperforms baselines in various capacities and matches the performance of ChatGPT in a few abilities, despite having 50x training data with previous best model and 100x parameters with ChatGPT. RLHF further improves the model's instruction-following ability and safety.We also release our code, datasets and model for further research

arXiv.org e-Print Archive

Construction of cardiovascular information extraction corpus based on electronic medical records

Author: Bingfei Zhao
Hongyang Chang
Hongying Zan
Kunli Zhang
Shuai Zhang
Publication venue: 'American Institute of Mathematical Sciences (AIMS)'
Publication date: 01/06/2023
Field of study

Cardiovascular disease has a significant impact on both society and patients, making it necessary to conduct knowledge-based research such as research that utilizes knowledge graphs and automated question answering. However, the existing research on corpus construction for cardiovascular disease is relatively limited, which has hindered further knowledge-based research on this disease. Electronic medical records contain patient data that span the entire diagnosis and treatment process and include a large amount of reliable medical information. Therefore, we collected electronic medical record data related to cardiovascular disease, combined the data with relevant work experience and developed a standard for labeling cardiovascular electronic medical record entities and entity relations. By building a sentence-level labeling result dictionary through the use of a rule-based semi-automatic method, a cardiovascular electronic medical record entity and entity relationship labeling corpus (CVDEMRC) was constructed. The CVDEMRC contains 7691 entities and 11,185 entity relation triples, and the results of consistency examination were 93.51% and 84.02% for entities and entity-relationship annotations, respectively, demonstrating good consistency results. The CVDEMRC constructed in this study is expected to provide a database for information extraction research related to cardiovascular diseases

Directory of Open Access Journals

CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Author: Bi Zhen
Chang Baobao
Chen Mosha
Chen Qingcai
Huang Fei
Li Lei
Li Linfeng
Liang Xiaozhuan
Ni Yuan
Shang Xin
Si Luo
Sui Zhifang
Tan Chuanqi
Tang Buzhou
Xie Guotong
Xu Jian
Yan Jun
Yin Kangping
Yuan Zheng
Zan Hongying
Zhang Kunli
Zhang Ningyu
Zong Hui
Publication venue
Publication date: 06/07/2021
Field of study

Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}

arXiv.org e-Print Archive

Studies on a hybrid way of rules and statistics for Chinese conjunction usages recognition

Author: Zan Hongying
Zhou Lijuan
Publication venue: 'Sociological Research Online'
Publication date: 01/01/2013
Field of study

Conjunction is a kind of functional words. Different conjunctions may contain different usages. The same conjunction may have different usages in different contexts. Studies on conjunction usage recognition are helpful for automatic understanding of modern Chinese texts. This paper adopts a hybrid way of rules and statistics to identify conjunction usages. Experiment results show that the methods combining rules and statistics are helpful for automatic recognition of conjunction usages. Among them, F measure of the participle and part-of-speech tagging corpus of the April , May, June 2000 People\u27 s Daily achieves 91.42%, 90.88%, 90.92% respectively in open test

Research Online

The Comparative Experimental Study of Multilabel Classification for Diagnosis Assistant Based on Chinese Obstetric EMRs

Author: Hongchao Ma
Hongying Zan
Kunli Zhang
Lei Zhuang
Yueshu Zhao
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2018
Field of study

Obstetric electronic medical records (EMRs) contain massive amounts of medical data and health information. The information extraction and diagnosis assistants of obstetric EMRs are of great significance in improving the fertility level of the population. The admitting diagnosis in the first course record of the EMR is reasoned from various sources, such as chief complaints, auxiliary examinations, and physical examinations. This paper treats the diagnosis assistant as a multilabel classification task based on the analyses of obstetric EMRs. The latent Dirichlet allocation (LDA) topic and the word vector are used as features and the four multilabel classification methods, BP-MLL (backpropagation multilabel learning), RAkEL (RAndom k labELsets), MLkNN (multilabel k-nearest neighbor), and CC (chain classifier), are utilized to build the diagnosis assistant models. Experimental results conducted on real cases show that the BP-MLL achieves the best performance with an average precision up to 0.7413 ± 0.0100 when the number of label sets and the word dimensions are 71 and 100, respectively. The result of the diagnosis assistant can be introduced as a supplementary learning method for medical students. Additionally, the method can be used not only for obstetric EMRs but also for other medical records

Directory of Open Access Journals

Natural language processing and chinese computing: 7th CCF international conference, NLPCC 2018, Hohhot, China, August 26-30, 2018, proceedings, part I

Author: Li Sujian
Ng Vincent
Zan Hongying
Zhang Min
Zhao Dongyan
Publication venue: Springer International Publishing AG
Publication date: 01/01/2018
Field of study

CERN Document Server

Natural language processing and Chinese computing: 8th CCF international conference, NLPCC 2019, Dunhuang, China, October 9-14, 2019, proceedings, part I

Author: Kan Min-Yen
Li Sujian
Tang Jie
Zan Hongying
Zhao Dongyan
Publication venue: Springer International Publishing AG
Publication date: 01/01/2019
Field of study

CERN Document Server